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ABSTRACT 

This paper introduces a novel retinal-inspired filter which is 
applied on video streams. We mathematically prove that un- 
der specific assumptions the spatiotemporal convolution turns 
into a spatial convolution with a short lifespan temporal ker- 
nel. As a consequence, the filter is applied on each image of 
the video stream separately. We analyze how each image is 
decomposed into a group of subbands, each one of which ap- 
proximates the image providing different kind of information. 
Afterwords, we propose an algorithm to reconstruct each im- 
age by exploiting the group of subbands. Finally, we defend 
our mathematical proofs by providing numerical simulations 
which show the relevance of our study. 

Index Terms — Retinal-inspired processing, non-separable 
spatiotemporal filter, frame theory, dual frame. 

1. INTRODUCTION 

The research related to image and video compression algo- 
rithms remains one of the most challenging scientific fields. 
This is due to the fact that images and videos are widely 
utilised not only for personal use but also for security rea- 
sons. As a result, a big amount of data need to be transmitted 
and/or saved in real time satisfying multiple constraints. 
These constraints are mostly related to the network band- 
width, the memory of the system, the distortion of the data 
or the energy of the system. The combination of all these 
constraints would give an optimal solution but, in practice, 
one should seek for a relevant trade-off between them. 

Closed-Circuit Television systems (CCTV) are one of the 
video processing applications which have been involved in the 
exponential increase of data. The most important challenge in 
this kind of systems is to minimize the energy consumption of 
the system which is totally related to the compression rate and 
the bandwidth of the real time transmision. At the same time, 
whatever the bandwidth is, it is always necessary to transmit 
only the most informative and meaningful data such that the 
reconstruction quality will be the best possible one. As a re- 
sult, it would be advantageous for CCTV system if we could 
propose an algorithms which saves power. 
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Trying to deal with the above problem we got inspired by 
the visual system in order to propose an alternative decom- 
position model for video. The retinal function has been ex- 
plicitly modeled by neuroscientists and the experimental re- 
sults have shown that this should be an efficient “compres- 
sion” model especialy with respect to the energy minimiza- 
tion [1,2]. This is due to the fact that the retina is a layered 
structure of different kinds of cells, the amount of which de- 
creases while they turn to connect to the optic nerve [3]. 

In this paper, our goal is to study the retinal-inspired trans- 
formation from the signal processing point of view in order 
to save power and set it as basis for our future bio-inspired 
dynamic codec. The first attempt in modeling this kind of 
filter was proposed as a bio-inspired codec of natural images 
by [4]. The authors tried to approximate the spatiotemporal 
variations of the retinal filtering using a Difference of Gaus- 
sians (DoG) pyramid based on [5] and [6], considering at the 
same time that each layer appears at different moment accord- 
ing to an exponential temporal function. We improve this fil- 
ter by taking into account explicitly the time in the design of 
our novel non-Separable sPAtioteMporal (non- SPAM) filter. 
The advantage of the non- SPAM filter is the fact that when 
a stimulus appears its non-SPAM transformation is based not 
only on the spatial neighborhood for the given time but also 
on previous times. As a result, this is a spatiotemporal trans- 
formation which enables to enrich the details of the signal. 

In section 2, we introduce the non-SPAM filter and ex- 
plain its bio-inspired nature. Then, in section 3, we introduce 
how the filter is able to decompose the input video. We prove 
that the non-SPAM filter is invertible in section 4. In section 
5, we propose the non-SPAM synthesis based on the frame 
theory. The numerical results are given in section 6. Section 
7 concludes the paper. 

2. OBJECTIVE AND RETINAL MODEL 

The general aim of this study is to introduce a novel codec 
for videos, captured for surveillance and/or security reasons, 
which need to be transmitted through the network to a client 
who is going to display and analyze the scenes (Fig. 1). The 
variations of the network bandwidth which depend on the lo- 
cation of the captured area are combined to the complexity of 
the scene. As a result, it is necessary to built a special archi- 



tecture which stands as a trade-off between them. For that rea- 
son, we have been inspired by the visual system which codes 
the luminance of light that reaches the eyes and transubstanti- 
ate them into spike trains (electrical impulses) which include 
all the necessary information of the input signal. It seems that 
this kind of code is efficient enough in order to be used in 
the reconstruction of the signal which is necessary in image 
processing. 


total number of images which form the video stream and 
t[g.,g i+1 \(t) is the indicator function which equals to 1 if 
gi < t < gi+ 1 , and 0 otherwise. The ideal spatiotempo- 
ral convolution of the non-SPAM and the video results in 
the function A(xR) which is called the activation degree is 
defined by: 

A(x, t) = K(x, t) * V(xR) (2) 



Fig. 1 : Non-SPAM compression schema. A video stream is 
captured by a CCTV system. Each image is filtered by the 
non-SPAM in order to be transmitted to the user where it is 
decoded and displayed. The time At between two images 
equals the life-time of the temporal filters R c {u) and R s (u ) 
which stand for the temporal behavior of the non-SPAM and 
tune the spatial changes of the filter. 

Our primary goal is to mimic the anatomy of retina and 
the functions of each group of the retinal cells in terms of sig- 
nal processing. Based on [2, 7] we assume that the group of 
cells which form the outer plexiform layer (photoreceptors, 
horizontal and biopolar cells) receives the light and spatially 
decomposes the signal with respect to their sensitivity into 
blurred versions. Each of these blurred versions is tempo- 
rally enriched in details while the signal is transmitted on the 
way to ganglion cells. These are the features that our filter, 
the non-Separable sPAtioteMporal ( non-SPAM ) filter, tries to 
mimic having a spatial behavior which varies with respect to 
time. 

2.1. Retinal Model 

We consider that a video consists of N different images. Each 
one of which is generated in specific time gi and lasts until 
time gi+i when the next image appears. Throughout the pa- 
per, we consider that a video is composed of images instead of 
frames in order to avoid any confusion with the frame theory 
vocabulary. Let us define a video in continuous time: 


X t 

where * is the convolution with respect to space and time. 
The non-SPAM filter mimics the function of the outer plexi- 
form layer of the retina. The space stands for the spatial trans- 
formation of the receptors and the time for temporal improve- 
ment of the initial transform by the center- suround structure 
of horizontal and bipolar cells [7]. With this filter we obtain a 
retinal-inspired image decomposition instead of the conven- 
tial ones i.e DCT [8], DWT [9] or filter banks [5]. Based 
on [2], we define the non-SPAM filter in continuous time and 
space, as: 


K(x,t) = C(x, t) — S(x, £), (3) 

where C(x,t) and S (x, t ) are the center and the surround spa- 
tiotemporal filters given by (4) and (5) respectively: 

C(x,t) = w c G ac (x)W(t), (4) 


S(x, t) = w s G as (x) (W * £ TS ) (; t ), (5) 

where w c and w s are constant parameters, G ac and G as are 
spatial Gaussian filters standing for the center and surround 
areas respectively, and E rs is an exponential temporal filter. 
The center temporal filter W ( t ) is given by: 

W ( t ) = E TGt n * (So - W C E TC ) ( t ), (6) 


where the gamma temporal filter E TG , n (t) is defined by: 


EtA 1 ) 


t n exp (— t/r ) 

r n+l 


(7) 


with n E N, r is a constant parameter (£V ?n ( t ) = 0 for t < 0), 
E rc is an exponential temporal filter, S 0 is the dirac function 

and * stands for the temporal convolution. In case that n = 0, 
the gamma filter turns to an exponential filter. The convo- 
lution of the temporal filter W ( t ) with the exponential filter 
E rs ( t ) is related to the delay in the appearance of the sur- 
round temporal filter with respect to the center one. 


3. RETINAL MODEL ANALYSIS 


N 

V(X,t) ='^2f i (x)l [g . t g. + l] (t), (1) 

i= 1 

where x E M 2 , t E [0, T], T E M + is the length of the 
video, fi(x) stands for the i - th image of the video, N is the 


The calculation of the activation degree A(x,t) in (2) applied 
to the video V (x, t) in (1) turns into a spatial convolution with 
a time- varying kernel as proved in the following proposition. 
To simplify the calculation, it is assumed that gi+ \ — gi = At 
is constant for all i ■= 1, . . . , N. 




Proposition 1. For all t < gi, the activation degree A(x, t ) 
in (2) applied on V (x, t) in (1) is A{x, t ) = 0. For t > gi> 
A(x,t) can be rewritten as: 

N 

A(x,t) = ^2,4>{x,t - gi)* fi(x), ( 8 ) 

i= 1 

where fi{x,u) is a spatial DoG filter weighted by two tempo- 
ral filters Rfiu ) and Rfiu) satisfying: 

cf)(x,u) — w c G(j c {x)Rfiu) w s G(j s (x)R s {u) , (9) 

u 

R c (u) = j W(£)d£, (10) 

max{0,u— At} 
u 

R s (u) = J W{£) * E Ts (£)d£, (11) 

max{0,u— At} 

and Rfiu ) = R s (u ) = 0 for u < 0. 

The proof of Proposition 1 is omitted due to the lack 
of place. According to Proposition 1, the activation degree 
A(x, t ) depends on all the images ffix) occurring before time 
t but the following corollary shows that, under some mild as- 
sumptions, the activation degree can be processed image per 
image. Before presenting this corollary, let us introduce a 
useful lemma. 

Lemma 1. The function fi{x,u) is a continuous and infinitely 
differential function for all u> 0 and all x G M 2 such that: 

lim fi{x,u) = f{x),\/x G M 2 , (12) 

u— )-+oo 

where <fi(x) is a DoG filter independent of u. 

Lemma 1 shows that <f(x, u ) converges toward a constant 
spatial DoG filter as u tends to infinity. This convergence is 



Fig. 2: Temporal filters Rfiu) and R s (u). 

illustrated in Fig. 2. Let 5 > 0 be a small positive constant. 
According to Lemma 1, there exists t c = t c (e) > 0 such that 

| (f>(x,u) — <p(x)\ < s,\/u > t c ,\/x G M 2 . (13) 

The following corollary comes from Proposition 1 together 
with Lemma 1 . 


Corollary 1. Let e > 0 and assume that the parameters of 
(j)(x,u) are chosen such that t c (e) < At. Let t such that 
gi < t < gN+i and i be the unique integer such that gi < 
t < gi+\. Then, the activation degree A(x,t) in (8) can be 
approximated by 

A(x,t) = fi(x)*<j>(x,t-gi)+ ^ fj(x)*<p(x) (14) 

3- 9j+t c <t 

where \ A(x,t) — A(x,t)\ < rj with g a small positive constant 
directly proportional to e. It follows that: 

A(x,t) = Ai(x,t) + Bfix) (15) 

where Afix, t) is the filtered version of f fix): 

Ai{x, t) = <j>(x, t — g^* (16) 

and Bi(x) is defined recursively by B\{x) = 0 and 

B i+ i(x) = Bi(x ) + Ai(x) (17) 

with 

Ai{x) = <f>(x) *fi(x). 

The interest of Corollary 1 is to show that, at time t, the 
activation degree A(x,t) can be approximated by A(x,t) 
which only depends on t via Afixfi). The remaining term 
Bfix ) in (15) corresponds to the Aj{xfif s for j < i 
occurring before time gi in (8). Since Lemma 1 yields 
Aj(xfic) « Aj(x ), the remaining term Bfix) is (almost) 
time independent and it does not convey any information on 
ffix). The factor Aj(x) is transmitted at the end of time in- 
terval [gj , gj+ 1 ] , hence, in practice, it is not useful to compute 
the full convolution A(x,t) in (15). It is sufficient to apply 
the non-SPAM filter on image fi(x) during the time interval 
[gi, gi + t c \ and to transmit A fix, t). 

The above corollary is crucial for the reason that it enables 
the simplification and representation of the non-SPAM filter 
like a block of time- varying DoG kernels. The DoG kernels 
have been extensively studied in the past [6, 10, 11] and they 
can be processed efficiently. Fig. 3 shows the non-SPAM de- 
composition of one image, say .AO)- which is extracted from 
a video stream. We have selected 5 different time samples 
t G {ti, £ 2 ,^ 3 , ^ 4 , £ 5 } of Afix,t) where gi < tj < g i+1 . 

4. NON-SPAM FRAME 

The goal of this section is to recall that the non-SPAM fil- 
ter, when it is applied on each image separately, is invertible 
and permits us to reconstruct the video image per image. For 
this reason, we establish that the non-SPAM filter has a frame 
structure according to the frame theory [11, 12]. Let us con- 
sider the image ffix) over the time interval \gi,g i+1 \. As un- 
derlined in the discussion following Corollary 1 , the decoder 




where M -1 denotes the inverse of a matrix M and M T de- 
notes its transpose. With a short abuse of notation, <f> is a 
family of vector of size the nm x n given by <f> = (fkj • 
1 < k < n, 1 < j < m, Ai is another vector which is 
given by A* = Ai(xk,Uj) : 1 < k < n, 1 < j < ra. The 
dual frame, which is necessary to have a perfect decoding at 
time t m [11, 12], is (<f> T <h) -1 <f> T . Instead of computing the 
above matrix operator which is time consuming and resource 
demanding, we can easily shown that (20) is a solution of the 
following least squares problem: 


fi = arg min ^ || fa ® fi - A 

JitI" 1 


I 


( 21 ) 


vi =1 


where the vectors <l) :i and A tJ are defined by: 


Fig. 3: Non-SPAM filter applied to fi(x) at 5 time samples. 

receives a stream of images Ai{x, t ) described by (16) and its 
goal is to reconstruct fi(x) from Ai(x, t ) for t G [gi,gi+ 1 \. 

For numerical purpose, we need to discretize the non- 
SPAM filter in space and in time. Let x\ y ...,x n G M 2 
be some sets of spatial sampling points and 1 1 , . . . , t m G 
[gi,gi+ 1 ] be temporal sampling points. Let us denote u\ — 
ti — gi, • . . , u m = t m — gi be the elapsed times between the 
tj ’ s and gi. Without any loss of generality, it is assumed that 
the ui s are the same whatever the considered time interval 
[gi,gi+ 1 ]. As a consequence, the continuous spatial convolu- 
tion Ai(x, t ) is approximated by the discrete convolution: 

Ai(Xk,tj) = (j)(Xh,tj 9i) ® 

n 

= ^2(p(x k - x p ,Uj)fi(x p ) = Ai(x k ,Uj), 

P= 1 

for all 1 < k < n and 1 < j < m. Let tpkj be the row vector 
of W 1 defined by 


(\)j = ((/)(x 1 ,u j ), . . . ,</>(x n ,Uj)), (22) 

Ai j — iyAi (^i 5 'Uj )? • • • j A{ {x n , Uj )^ . (23) 

Hence, the estimate fi is computed by using a gradient de- 
scent algorithm. 

6. NUMERICAL RESULTS 

We have captured a video with a rate of 20 images per second 
and we have applied the non- SPAM on each image of size 
64x64 using the software tool MATLAB. 



(a) Original (b) Reconstructed 

Image Image 


Vk,j = \4>( x k — , 4>{x k — x n , Mj) j . (18) 

Let us denote the sampled version of the image ,f, (x): 


Fig. 4 : Reconctruction of the 400 th image of the video 
stream. 


fi = (fifa), . . . , fi(x n )) , (19) 

and || f /Y|J be the Euclidean norm of fi. In our previouw work, 
we have proven that the family of vectors <1> is a frame [13,14]. 

5. RETINAL MODEL SYNTHESIS 

The optimal reconstruction of the input video, image per im- 
age, is possible when we provide to the decoder all the coeffi- 
cients of the non-SPAM image. At time t m ending the interval 
[gi, < 7 i+i], the optimal estimate fi of fi is given by: 

fi = ($ T $)“ Va<, (20) 


All the parameters which are related to the lifespan of 
the non-SPAM filter are tuned according the video rate (see 
Corollary 1) At = 50msec, re = 10. * 10 -3 sec, rs = 
9. * 10 _4 sec, tq = 1. * 10 -3 sec, n = 5, w c = 0.75, and 
w s = 1. The rest of the parameters which are related to 
the spatial domain, a c ,a s ,w c ,w s , are biologically plausible 
and they have been obtained by modeling the center- surround 
structure of the retinal cells’ receptive fields [1,11]. 

The reconstruction results generated by the total number 
of coefficients are almost perfect and they are illustrated in 
Fig. 4. Hence, there is some redundancy within the trans- 
mitted coefficients. For this reason, we propose to apply the 
Rank-Order-Coding (ROC) model proposed in [1] . The ROC 








model is traditionally used to convert an analog signal into a 
rank order of electrical impulses (spikes). The spike which is 
first emitted has been caused by a rapid excitation because of 
a strong signal. 

In this study, we apply the ROC model only with respect 
to the informative coefficients which are generated by the 
non-SPAM transformation but we do not aim to produce any 
spikes. We sort in a descending order the coefficients of each 
decomposition layer and we select a percentage of coeffi- 
cients in Ai keeping the ones with the highest energy omitting 
the rests. In Fig. 5, we have selected one image of the video 
stream and we have decided to use 5 different percentages of 
the highest values of the activation degree of 5 DoG kernels. 



(a) Original Image (b) 20% of (c) 40% of 

fiooo(%) coefficients coefficients 



(d) 60% of (e) 80% of (f) 100% of 

coefficients coefficients coefficients 


Fig. 5: Reconstruction based on the ROC model. 


7. CONCLUSION 

This paper proposes to study the analysis and the synthesis 
of a retinal-inspired filter which is applied on video streams. 
We have shown that this filter can be applied image per image 
without any significant loss of information. Our future goal is 
to adapt this filter into a bio-inspired codec which is going to 
produce an event-based code. 
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